Skip to content

Conversation

@alexwilcoxson-rel
Copy link
Contributor

@alexwilcoxson-rel alexwilcoxson-rel commented Oct 17, 2025

Description

This change enables physical expr filter pushdown through the DeltaScan ExecutionPlan impl. gather_filters_for_pushdown by default assume that no filters can be pushed down, but since DeltaScan is a wrap around the Parquet data source exec, we can push the filters to that.

This is important to leverage dynamic filter pushdown for hash join for example.

To verify these changes I did the following:

  • Created a 100 million row table of uuidv7 ID strings so it would have good statistics.
  • Created a small table of three of those uuidv7 IDs
  • Analyzed a simple join: EXPLAIN ANALYZE SELECT * FROM small JOIN big ON small.id = big.id

Results (these are the metrics from the parquet scan of the big table):

Without pushdown

metrics=[
    output_rows=100000000,
    elapsed_compute=10ns,
    batches_split=0,
    bytes_scanned=1542517765,
    file_open_errors=0,
    file_scan_errors=0,
    files_ranges_pruned_statistics=0,
    num_predicate_creation_errors=0,
    page_index_rows_matched=0,
    page_index_rows_pruned=0,
    predicate_evaluation_errors=0,
    pushdown_rows_matched=0,
    pushdown_rows_pruned=0,
    row_groups_matched_bloom_filter=0,
    row_groups_matched_statistics=0,
    row_groups_pruned_bloom_filter=0,
    row_groups_pruned_statistics=0,
    bloom_filter_eval_time=218ns,
    metadata_load_time=393.613861ms,
    page_index_eval_time=218ns,
    row_pushdown_eval_time=218ns,
    statistics_eval_time=218ns,
    time_elapsed_opening=17.92454ms,
    time_elapsed_processing=65.071375545s,
    time_elapsed_scanning_total=101.423530818s,
    time_elapsed_scanning_until_data=2.46365709s
]

With pushdown

metrics=[
    output_rows=6327680,
    elapsed_compute=10ns,
    batches_split=0,
    bytes_scanned=98546937,
    file_open_errors=0,
    file_scan_errors=0,
    files_ranges_pruned_statistics=0,
    num_predicate_creation_errors=0,
    page_index_rows_matched=6327680,
    page_index_rows_pruned=672320,
    predicate_evaluation_errors=0,
    pushdown_rows_matched=0,
    pushdown_rows_pruned=0,
    row_groups_matched_bloom_filter=0,
    row_groups_matched_statistics=7,
    row_groups_pruned_bloom_filter=0,
    row_groups_pruned_statistics=93,
    bloom_filter_eval_time=2.246653ms,
    metadata_load_time=466.780613ms,
    page_index_eval_time=1.693939ms,
    row_pushdown_eval_time=218ns,
    statistics_eval_time=12.888981ms,
    time_elapsed_opening=464.314794ms,
    time_elapsed_processing=3.527617582s,
    time_elapsed_scanning_total=4.976550961s,
    time_elapsed_scanning_until_data=262.494821ms
]

Notice the output rows was reduced from 100 million to ~6 million and the cumulative time spent scanning from 101s to 5s. You will also see the various statistics and pruning metrics being nonzero indicating it was able to leverage the dynamic filter built out of the left hand side of the hash join.

Related Issue(s)

n/a

Documentation

Docs for the method implemented, gather_filters_for_pushdown

Also, the associated method, handle_child_pushdown_result

I am leaving that as the default impl

@alexwilcoxson-rel alexwilcoxson-rel force-pushed the enable-phsyical-filter-pushdown branch from 77b8ccd to 8ae819f Compare October 17, 2025 03:04
@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Oct 17, 2025
@codecov
Copy link

codecov bot commented Oct 17, 2025

Codecov Report

❌ Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.76%. Comparing base (daa8a68) to head (e83f4a8).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/delta_datafusion/table_provider.rs 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3859      +/-   ##
==========================================
- Coverage   73.77%   73.76%   -0.02%     
==========================================
  Files         151      151              
  Lines       39176    39165      -11     
  Branches    39176    39165      -11     
==========================================
- Hits        28903    28890      -13     
- Misses       9001     9004       +3     
+ Partials     1272     1271       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking b/c I doubt the PR, just for egoistic reasons to learn 😆.

IIUC, this sees filters that come from the parent node. and decides if it can push these down. We also have on the TableProvider supports_filters... o.a. SO my guess is that the planner will call this on the provider which will ultimately then deter mine what gets passed into the function we do have here, right?

I am currently rewriting the table provider to since the current one has some organic growth artifacts and a refactor to handle column mapping etc. seems much ore work. SO just trying to understand how these relate and where we want to end up.

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking b/c I doubt the PR, just for egoistic reasons to learn 😆.

IIUC, this sees filters that come from the parent node. and decides if it can push these down. We also have on the TableProvider supports_filters... o.a. SO my guess is that the planner will call this on the provider which will ultimately then deter mine what gets passed into the function we do have here, right?

I am currently rewriting the table provider to since the current one has some organic growth artifacts and a refactor to handle column mapping etc. seems much ore work. SO just trying to understand how these relate and where we want to end up.

@alexwilcoxson-rel
Copy link
Contributor Author

My understanding is the current supports_filters_pushdown is on the TableProvider. TableProvider.supports_filters_pushdown is used during logical planning optimization to see if during query planning the filters can be pushed into the table scan. Then we in delta-rs use the filters we know about at plan time to prune the delta log.

These new functions, gather_filters_for_pushdown and handle_child_pushdown_result are used during physical planning and execution. So for example after the left side of a hash join is collected and built, the filters that represent the values in the collected side can be pushed in the probe side. Thus filtering that side. This is all new functionality in Datafusion 50.

Implementing gather_filters_for_pushdown for DeltaScan then means we will consider all of the filters provided by the parent and we determine what to push down based on what the child (parquet exec) supports.

@roeap roeap enabled auto-merge (squash) October 17, 2025 17:51
auto-merge was automatically disabled October 17, 2025 19:31

Head branch was pushed to by a user without write access

@alexwilcoxson-rel alexwilcoxson-rel force-pushed the enable-phsyical-filter-pushdown branch from 8ae819f to e83f4a8 Compare October 17, 2025 19:31
@alexwilcoxson-rel
Copy link
Contributor Author

thanks @roeap I rebased again, if you need to enable the merge again

@roeap roeap enabled auto-merge (squash) October 17, 2025 19:49
@roeap roeap merged commit 60efa03 into delta-io:main Oct 17, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants